-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable automatic cache with dask #1024
Conversation
… .values property. Added new method .compute().
# Conflicts: # xarray/test/test_dask.py
Well, crud. This introduces a regression where DataArray.chunk() converts the data and the coords to dask. This becomes enormously problematic later on as pretty much nothing expects a dask-based coord. [edit] fixed below |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, crud. This introduces a regression where DataArray.chunk() converts the data and the coords to dask. This becomes enormously problematic later on as pretty much nothing expects a dask-based coord.
I agree that this could make sense, but I'd like to understand in a little more detail. Isn't this how things work already, even before this PR?
I agree that turning an existing pandas.Index
(dimension labels) into a chunked dask array is probably not desirable, but it is less obvious to me that .chunk()
should not apply to other coordinates.
I am also a little concerned about how this would affect existing coordinate variables that are already dask.arrays. If an coordinate is already chunked, then calling .chunk()
with an new chunksize should probably change it.
Either way, any changes here should be documented under "Breaking changes" in what's new.
if v.chunks is not None: | ||
new_chunks = list(zip(v.dims, v.chunks)) | ||
if any(chunk != chunks[d] for d, chunk in new_chunks | ||
if d in chunks): | ||
raise ValueError('inconsistent chunks') | ||
chunks.update(new_chunks) | ||
if chunks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should this need chunks
to not be empty already? That seems strange (maybe backwards) to me.
I might simply make this:
for dim, size in self.dims.items():
if dim not in chunks:
chunks[dim] = (size,)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if none of the data_vars use the dask backend, then you want chunks to return None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this method is inconsistent with Variable.chunks
, but it currently always returns a dict.
I would either skip this change or use something like my version.
What happened before this PR was that all coords were blindly converted to dask on chunk(). Then, the first time anything invoked the values property, e.g. Something as simple as If you deliberately use a dask array as a coord, it won't be converted to numpy. However I can't think of any reason why anybody would want to do it in practice. |
# Conflicts: # doc/whats-new.rst
I added the disclaimer in the release notes. |
I can't reproduce the above failure test.test_conventions.TestEncodeCFVariable.testMethod=test_missing_fillvalue.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay here -- my comments were stuck as a "pending" GitHub review.
I am still wondering what the right behavior is for variables used as indexes. (These can be dask arrays, too.)
I think there is a good case for skipping these variables in .chunk()
, but we probably do want to make indexes still cache as pandas.Index
objects, because otherwise repeated evaluation of dask arrays to build the index for alignment or indexing gets expensive.
@@ -792,13 +806,19 @@ def chunks(self): | |||
array. | |||
""" | |||
chunks = {} | |||
for v in self.variables.values(): | |||
for v in self.data_vars.values(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am concerned about skipping non-data_vars
here. Coordinates could still be chunked, e.g., if they were loaded from a file, or created directly from dask arrays.
if v.chunks is not None: | ||
new_chunks = list(zip(v.dims, v.chunks)) | ||
if any(chunk != chunks[d] for d, chunk in new_chunks | ||
if d in chunks): | ||
raise ValueError('inconsistent chunks') | ||
chunks.update(new_chunks) | ||
if chunks: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this method is inconsistent with Variable.chunks
, but it currently always returns a dict.
I would either skip this change or use something like my version.
@@ -851,6 +871,9 @@ def selkeys(dict_, keys): | |||
return dict((d, dict_[d]) for d in keys if d in dict_) | |||
|
|||
def maybe_chunk(name, var, chunks): | |||
if name not in self.data_vars: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point about performance, but I think that mostly holds true for indexes. So I would be inclined to adjust this to only skip variables in self.dims
(aka indexes used for alignment).
I am still concerned about skipping coords if they are already dask arrays. If they are already dask arrays, then .chunk()
should probably adjust their chunks anyways.
I've been thinking about this... Maybe the simple, clean solution is to On 12 Oct 2016 02:18, "Stephan Hoyer" [email protected] wrote:
|
ping - how do you prefer me to proceed? |
I'm nervous about eager loading, especially for non-index coordinates. They can have more than one dimension, and thus can contain a lot of data. So potentially eagerly loading non-index coordinates could break existing use cases. On the other hand, non-index coordinates indeed checked for equality in most xarray operations (e.g., for the coordinate merge in align). So it is indeed useful not to have to recompute them all the time. Even eagerly loading indexes is potentially problematic, if loading the index values is expensive. So I'm conflicted:
I'm going to start throwing out ideas for how to deal with this: Option AAdd two new (public?) methods, something like
Hypothetically, we could even have options for turning this caching systematically on/off (e.g., Your proposal is basically an extreme version of this, where we call Advantages:
Downsides:
Option BLike Option A, but someone infer the full set of variables that need to be cached (e.g., in a This solves the downside of A, but diminishes the predictability. We're basically back to how things work now. Option CCache dask.array in Advantages:
Downsides:
Option DLoad the contents of an This has the most predictable performance, but might cause trouble for some edge use cases? I need to think about this a little more, but right now I am leaning towards Option C or D. |
Hi Stephen, On 21 Oct 2016 4:36 am, "Stephan Hoyer" [email protected] wrote:
|
I'm going to ping the mailing list for input, but I think it would be On Tue, Oct 25, 2016 at 11:11 AM, crusaderky [email protected]
|
Option D seems indeed the cleanest and safest option, but
I can see use cases where this might happen. For example, It is common for 1, 2 or higher-dimension unstructured meshes that the coordinates x, y, z are arranged as 1-d arrays of length that equals the number of nodes (which can be very high!). See for example ugrid conventions. I admit that currently xarray is perhaps not very suited for handling unstructured meshes, but IMO it has great potential (especially considering multi-index support) and I'd love to use it here. |
Right now, xarray is not going to be great fit for such cases, because we already cache an index in memory for any labeled indexing operations. So at best, you could do something like I doubt very many people are relying on this, though indeed, this would include some users of an array database we wrote at my former employer, which did not support different chunking schemes for different variables (which could make coordinate lookup very slow). CC @ToddSmall in case he has opinions here. For out-of-core operations with labels on big unstructured meshes, you really need a generalization of the |
Oh yes, true!
Indeed, that doesn't look very nice.
From what I intend to do next with xarray, I'd say that extending its support for out-of-core operations to big indexes would be a great feature! I haven't seen yet how |
Yes, please do! @crusaderky I think we are OK going ahead here with Option D. If we do eventually extend xarray with out of core indexes, that will be done with a separate layer (not in |
roger that, getting to work :) |
Awesome, thanks for your help! On Sat, Nov 5, 2016 at 6:56 PM crusaderky [email protected] wrote:
|
Conflicts: xarray/test/test_dask.py
Eagerly cache only IndexVariables (e.g. coords that are not in dims. Coords that are not in dims are chunked and not cached.
Finished and waiting for code review |
@@ -1076,10 +1101,16 @@ def __init__(self, dims, data, attrs=None, encoding=None, fastpath=False): | |||
type(self).__name__) | |||
|
|||
def _data_cached(self): | |||
if not isinstance(self._data, PandasIndexAdapter): | |||
self._data = PandasIndexAdapter(self._data) | |||
# Unlike in Variable._data_cached, always eagerly resolve dask arrays |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we wanted to always eagerly load IndexVariable
objects into memory without caching at all?
That would suggest we should put something like self._data = PandasIndexAdapter(self._data)
in the constructor, and make _data_cached
and _data_cast
on the subclass dummy methods.
@@ -874,6 +887,9 @@ def selkeys(dict_, keys): | |||
return dict((d, dict_[d]) for d in keys if d in dict_) | |||
|
|||
def maybe_chunk(name, var, chunks): | |||
if name in self.dims: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, maybe put this logic in IndexVariable
instead? We could define a chunk
method that looks like:
def chunk(self, ...):
return self.copy(deep=False)
IndexVariables to eagerly load their data into memory (from disk or dask) as soon as they're created
Changed to cache IndexVariable._data on init. Please review... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have one minor suggestion for a test, but I'll fix that in a follow-on PR. This looks good to me, thanks!
for k, v in actual.variables.items(): | ||
# IndexVariables are eagerly cached | ||
if k in actual.dims: | ||
self.assertTrue(v._in_memory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be slightly simpler just as self.assertEqual(v._in_memory, k in actual.dims)
Thanks for your patience! This is a nice improvement. I have an idea for a variation that might make for a cleaner (less dask specific) way to handle the remaining caching logic -- I'll add you a reviewer on that PR. |
Happy to contribute! On 14 Nov 2016 16:58, "Stephan Hoyer" [email protected] wrote:
|
@crusaderky @shoyer There are still cases where dask arrays are converted to ndarrays where I think they shouldn't be: if you create a |
@@ -277,10 +277,21 @@ def data(self, data): | |||
"replacement data must match the Variable's shape") | |||
self._data = data | |||
|
|||
def _data_cast(self): | |||
if isinstance(self._data, (np.ndarray, PandasIndexAdapter)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this branch not also apply to dask_array_type
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, if you manually create a Variable
with a dask array you'll get a LazilyIndexedArray
at this point. Should this not also be kept unchanged?
This is a follow-up to generalize the changes from pydata#1024: - Caching and copy-on-write behavior has been moved to separate array classes that are explicitly used in `open_dataset` to wrap arrays loaded from disk (if `cache=True`). - Dask specific logic has been removed from the caching/loading logic on `xarray.Variable`. - Pickle no longer caches automatically under any circumstances. Still needs tests for the `cache` argument to `open_dataset`, but everything else seems to be working.
@shoyer Great, thanks, I'll give that a try. |
* Disable all caching on xarray.Variable This is a follow-up to generalize the changes from #1024: - Caching and copy-on-write behavior has been moved to separate array classes that are explicitly used in `open_dataset` to wrap arrays loaded from disk (if `cache=True`). - Dask specific logic has been removed from the caching/loading logic on `xarray.Variable`. - Pickle no longer caches automatically under any circumstances. Still needs tests for the `cache` argument to `open_dataset`, but everything else seems to be working. * Fixes for test failures * Fix IndexVariable.load * Made DataStores pickle-able * Add dask.distributed test * Fix failing Python 2 tests * Fix failing test on Windows * Alternative fix for windows issue * yet another attempt to fix windows tests * a different windows fix * yet another attempt to fix test on windows * another attempt at fixing windows * Skip remaining failing test on windows only * Allow file cleanup failures on windows
Disabled auto-caching on dask; new .compute() method
Fixes #902